A fast kernel independence test for cluster-correlated data

Cluster-correlated data receives a lot of attention in biomedical and longitudinal studies and it is of interest to assess the generalized dependence between two multivariate variables under the cluster-correlated structure. The Hilbert–Schmidt independence criterion (HSIC) is a powerful kernel-based test statistic that captures various dependence between two random vectors and can be applied to an arbitrary non-Euclidean domain. However, the existing HSIC is not directly applicable to cluster-correlated data. Therefore, we propose a HSIC-based test of independence for cluster-correlated data. The new test statistic combines kernel information so that the dependence structure in each cluster is fully considered and exhibits good performance under high dimensions. Moreover, a rapid p value approximation makes the new test fast applicable to large datasets. Numerical studies show that the new approach performs well in both synthetic and real world data.

Measuring general, possibly nonlinear, dependence between two multivariate variables plays a significant role in many scientific applications. For example, assessing the relationship between the overall composition of the microbiome, which includes hundreds of microbial taxa, and various host metabolites from a specific metabolic pathway is often of central interest in many studies [1][2][3] . Determining and understanding the dependence between such variables have successfully provided important clues as to the mechanisms and biological interactions among the variables leading to better understanding of the systems underlying many different conditions.
In the meantime, correlated observations are also frequently obtained in many practical situations. Familybased samples in genome-wide association studies are often used to assess a genetic association to a disease 4 . Repeated/longitudinal observations are also prevalent in biomedical research and the goal of research is to figure out how explanatory variables influence an outcome over time 5 . Within this context of cluster-correlated data, there is also pressing interest in understanding the general dependency between multivariate variables, e.g. the correlation between microbiome composition and metabolic pathways, in longitudinally collected samples.
An example of a study in which we are interested in multivariate dependency in longitudinal samples which also motivates this project is the Menopause Studies-Finding Lasting Answers for Symptoms and Health (MsFLASH) study. MsFLASH was a randomized clinical trial in which women were randomized to one of three arms (two placebo, and one experimental treatment with vaginal estrogen) with the objective of improving symptoms of menopause 6 . The underlying biological hypothesis was that the introduction of estrogen into the vaginal environment would shift the microbiota and result in reduced symptoms. However, despite considerable preliminary research and years of effort, the trial was null and no differences in symptom improvement were identified across the arms. Thus, in a post mortem evaluation of the trial, investigators have been studying why the trial failed despite the preliminary data, including evaluations of whether the underlying hypotheses were correct. Initial work concentrated on the microbiome and they were able to show that, in fact, the microbiome was altered by the introduction of estrogen. Now, a subsequent analysis is focused on whether these microbes are associated with metabolic activity, as one would expected. Ultimately, metabolites are the biochemicals produced by the microbes that should impact symptom development. Thus, a central analytic objective was to evaluate the general dependency between microbiome composition and pre-specified metabolic pathways across time, yet how to optimally conduct this analysis is unclear.
Classical measures of association, such as Pearson correlation 7 , Kendall's τ 8 , and Spearman's ρ 9 , are mainly focused on a simple dependence structure and they could be zero even when two random variables are dependent. As we are entering the big data era, challenging data, both in the dimension and size, is becoming prevalent, and the attention to the association testing method for detecting complex dependence structures is also naturally increasing. Hence, many methods have been proposed to develop tests of independence against general types of alternatives, such as the RV coefficient or its extensions [10][11][12][13] , the distance covariance coefficient or its extensions [14][15][16] , the graph-based test 17 , the rank-based test 18,19 , and the kernel-based test 20,21 .
In particular, kernel-based tests are often utilized to evaluate the association between overall microbiome compositions and outcomes of interest or host gene expressions [22][23][24] . It is well known that kernels can be used to embed the microbiome structure and many different kernels have been developed: UniFrac kernels can accommodate the phylogenetic structure 25 , generalized UniFrac kernels are sensitive to abundance changes in moderately abundant lineages 26 , and the Bray-Curtis kernel quantifies the taxonomic dissimilarity of two microbial communities 22 .
In this paper, we base our approach on the most popular kernel-based test, the Hilbert-Schmidt independence criterion (HSIC), proposed by Gretton et al. 21 . As a nonparametric approach, the HSIC has the potential to capture arbitrary dependence between two random variables. It can be viewed as the distance between the joint distribution and the product of the marginals embedded in a reproducing kernel Hilbert space (RKHS).
However, most kernel-based methods assume that pairs are independent and identically distributed (i.i.d.) and they thus cannot be directly applied to correlated data, particularly clustered data. Moreover, the asymptotic distribution of the HSIC to obtain the threshold of the test given level α of the test is not practical since the null distribution has a complicated form, and cannot be evaluated directly. Therefore, a permutation test is usually preferred in many applications, however, it is computationally prohibitive when the sample size is large or when the alpha level is low, as in the case of our motivating data where we are interested in studying associations between the microbiome and numerous metabolic pathways.
Based on the HSIC, we propose a new test of independence for cluster-correlated data. The new test combines kernel information so that the dependence structure in each cluster is fully considered. Furthermore, compared to other HSIC-based tests that rely on costly Monte Carlo permutation procedures, a closed form of p value approximation is proposed, making the new test much faster and more efficient than the existing tests, particularly for large samples. Numerical studies demonstrate that the new method is powerful under high dimensions in both synthetic and real world data. Our work is related to recent work of Zhan et al. 13 , but differs in that the new test statistic has a computationally more efficient form.
The organization of the paper is as follows. In "Materials and methods" section, we provide our problem setting on clustered data and briefly review the test based on the HSIC. We then propose the new test statistic and the testing procedure for cluster-correlated data. In "Results" section, we examines the performance of the new tests under various simulation settings and the new approach is illustrated by a real data application on vaginal microbiome data. Finally, the discussion is given in "Discussion" section.

Materials and methods
Problem setting. The goal is to test for association between two sets of variables X and Y, such as microbiome composition, gene expression, or profiles of other types of genomic data. Specifically, let X and Y be multivariate random variables with marginal distributions f X on X in R p and f Y on Y in R q , respectively. Let f XY be the joint distribution on X × Y . Then, we aim to test We consider samples of clustered data: observations (X 1 , Y 1 ), . . . , (X N , Y N ) ∈ (X, Y ) of total sample size N are drawn identically from f XY and can be divided into m clusters of size l (i = 1, . . . , m) , that is, Here, ml = N and m clusters are independent from each other while having identical within-cluster correlation structure.
Hilbert-Schmidt independence criterion. The Hilbert-Schmidt independence criterion (HSIC) was first proposed by Gretton et al. 20 . They first map the observations into a reproducing kernel Hilbert space F (RKHS) generated by a given kernel k(·, ·) , that is, for each point x ∈ X , there corresponds an element (feature is a unique positive definite kernel. They then consider a cross-covariance operator between feature maps and the squared Hilbert-Schmidt norm of the cross-covariance operator, which can be expressed as where X ′ and Y ′ are independent copies of X and Y, respectively. Here, when characteristic kernels, such as the Gaussian kernel or Laplacian kernel, are used for k X and k Y , An empirical estimate of HSIC was proposed by Gretton et al. 20 : Let K X and K Y be kernel matrices with entries k X (X i , X j ) and k Y (Y i , Y j ) , respectively. Then, HSIC can be rewritten as www.nature.com/scientificreports/ where K X = H N K X H N and K Y = H N K Y H N are the centered kernel matrices of K X and K Y , respectively, and H N = I N − 1 N 1 t N /N is a centering matrix with I N being an identity matrix of order N and 1 N being a N × 1 vector of all ones.
Gretton et al. 21 studied asymptotic behaviors of HSIC and found that HSIC is degenerate under the null hypothesis of independence. Hence, they proposed a few approaches to approximate it: a Gamma approximation and a permutation approach. Despite the large computational cost, they recommend the permutation approach since the Gamma approximation easily loses power due to a very low variance estimate.
Related works. The HSIC-based test is widely used in many applications since it is powerful and versatile without strong model assumptions and the new test is also in line with this principle. Recently, several approaches have been proposed. For example, Zhan et al. 13 proposed a kernel RV coefficient (KRV) to capture the dependence between two random variables. KRV is a generalized RV coefficient using kernels and it can capture complex relationships, such as nonlinear correlations, among the individual data types. KRV is equivalent to the new test under the permutation null distribution, but the new test has simpler forms since it does not require the standardization.
Recently, Liu et al. 27 proposed the HSIC-based test for cluster-correlated data, denoted by HSIC cl . They derived the asymptotic distribution of HSIC under the null hypothesis of independence between two variables but in the presence of sample correlations. Compared to the HSIC that has an inflated type I error under the clustercorrelated structure, HSIC cl not only controls the type I error well but also performs better than the HSIC. The asymptotic null distribution of HSIC cl is the mixture of chi-square distributions, but the weights are unknown and it should be estimated with empirical counterparts. A Davies' exact method 28 is a way to approximate the asymptotic distribution of HSIC, so the authors adopt this approach. However, the asymptotic null distribution of HSIC cl has more complicated expressions of the weigths than the HSIC and it needs to compute the eigenvalues of a N 2 by m 2 matrix, which provides excessive computational burden for large datasets. Moreover, the Davies' method shows too much conservativeness (see Table 1). To address this, we work under the permutation null distribution and develop a test statistic in a simple manner. Details are in the following section.
HSIC for cluster-correlated data. As discussed in the previous section, given a test statistic, the next question is to determine the critical value of the test with the correct size. The main challenge of the HSIC application is to determine the critical value of the test with the correct size. When using the original HSIC, a major difficulty arises in this step since the asymptotic null distribution of the HSIC is an infinite weighted sum of chi-square random variables and it cannot be applied in practice. Though the Davies' method can be used, as discussed in the previous section, it is computationally expensive and too conservative. Moreover, this is not accurate under the small sample size setting.
To address this, we work under the permutation null distribution and determine whether to reject the null hypothesis or not by the permutation test. The permutation approach does not need to resort to the estimation, asymptotic properties, or any underlying conditions. Hence, the permutation test has been utilized in many applications 29 and the exact cutoff for the test can be obtained from the permutation null distribution. Through N! permutations of shuffling rows and columns of one kernel matrix, the p value can be obtained as the proportion of permuted statistic values greater than or equal to the actual test statistic value. This yields a valid level of the test for finite samples.
Based on the method of obtaining the critical value of the new test under the clustered data setting in the previous section, we now consider testing the null hypothesis of independence H 0 : X ⊥ Y . As discussed in the previous section, the HSIC is the cross-covariance operator in RKHS, but it also can be interpreted as a Euclidean-like distance measure between kernel values under the permutation distribution. To be more specific, the Euclidean-like distance measure between kernel values can be defined as follows: where k X ij and k Y ij are (i, j)-th elements of the kernel matrices K X and K y , respectively. Then, where C is a constant under the permutation. When the kernel matrices are centered, the HSIC is equivalent to the Euclidean-like distance measure between kernel values under the permutation distribution.
One simple way to accommodate cluster-correlated structure is to analyze data at the cluster/subject level, such as utilizing averaged observations at different clusters. However, this could result in loss of information. Moreover, variations across clusters may not be reflected (see Table 3, Fig. 2) To accommodate both differences between kernel values and variations across clusters, we define cluster-wise kernel matrices K cl X and K cl Y . Specifically, we combine kernel information for each cluster by averaging kernel values so that the similarity within and between clusters is well reflected. In other words, the original N × N kernel matrices K X and K Y become m × m cluster-wise kernel matrices K cl X and K cl Y , respectively. Note that K cl X and K cl Y are still symmetric and positive semi-definite. Figure 1 illustrates the formulation of the cluster-wise kernel matrix. For small or moderate sample sizes, we can conduct the permutation test directly and this provides a valid level of the test. However, permutations may be computationally cumbersome when the number of clusters is large. Hence, when the number of clusters is very large, we need to estimate the permutation null distribution of the test statistic. To estimate the p value of the test based on HSIC new without permutations, we propose a moment matching approach using a Pearson type III approximation 30,31 . Specifically, we approximate the permutation null distribution of HSIC new by the Pearson type III distribution. This requires the first three moments of the exact permutation distribution of HSIC new . Let µ , σ 2 , and γ be the mean, variance, and skewness of HSIC new obtained from the permutation null distribution Detail expressions are provided in Supplementary A). Then, the p value of the HSIC new can be analytically computed by the Pearson type III distribution where a = 4/γ 2 , s = σ γ /2 , and = µ − 2σ/γ . We check the efficacy of this approach and it is provided in the following "Results" section.
The choice of kernel and the bandwidth parameter have been studied for two-sample comparison. For example, Gretton et al. 32 studied a linear combination of Gaussian kernels to maximize the power of the test and (6)  www.nature.com/scientificreports/ Ramdas et al. 33 found that, under some conditions, the power of the test based on the Gaussian kernel is independent of the bandwidth when the median heuristic, the median of all pairwise distances among observations, is used. Song and Chen 34 studied the bandwidth choice under the permutation null distribution and showed that the median heuristic is a reasonable choice. Therefore, since the main data variation is well captured by the median heuristic under the permutation null distribution, we propose to use the Gaussian kernel with the median heuristic for the proposed test.

Results
Efficacy of the testing procedure. In this section, we briefly check how accurate the Pearson type III approximation defined in the previous section is compared to the permutation approach as well as the Davies' method 28 . To this end, we observe the empirical type I error rate of the HSIC tests from 1000 simulation runs and compare the performance of the Pearson type III approximation with the permutation approach and the Davies' method for the p-dimensional Gaussian and log-normal data, N p (0 p , �) and log N p (0 p , �) with � (i,j) = 0.4 |i−j| , respectively, under the independent and identically distributed (i.i.d.) setting. Table 1 shows the empirical size of the HSIC tests based on the Pearson type III approximation (Pearson), the permutation approach with 1000 permutations (Perm) and the Davies' method (Davies) under different sample sizes and dimensions. We see that the permutation distribution can be well approximated by the Pearson type III approximation and the Pearson type III approximation in general controls the type I error well, while the Davies' method is too conservative.
We also check how much faster the Pearson type III approximation is compared to the permutation approach with 1000 permutations (perm=1000) and 10,000 permutations (perm=10,000). Notice that the permutation approach becomes more accurate as the number of permutations increases, which increases the computational time as well. Table 2 shows average runtimes for each sample size when p = q = 100 . In comparison to the permutation approach, we see that the Pearson type III approximation can save a significant amount of computational cost.
Power analysis. We now examine the performance of the new test through simulations. We compare the new test with the existing HSIC-based test for cluster-correlated data proposed by Liu et al. 27 , denoted by HSIC cl , and the original HSIC. Here, we follow the simulation setup in Liu et al. 27 for power comparison. In addition, we check the computational cost of the tests.
Specifically, we generate m clusters from the p-dimensional (p = q) Gaussian data: where is the Kronecker product. Here, we fix the cluster size 3 for all i = 1, . . . , m . We also consider an exchangeable correlation structure W across p variables in X and AR(1) correlation structure c across three time points.
We use the Gaussian kernel and the median heuristic bandwidth. We simulate 1000 datasets and the significance level is set to be 0.05. Table 3 shows the empirical size of tests at 0.05 significance level by 1000 simulation runs under different dimensions (p = q) and within-cluster correlations (ρ c ) when m = 100 and ρ W = 0.5 . Corresponding standard errors are provided in Supplementary D. We see that the original HSIC does not control the type I error at all and the inflation increases as the within-cluster correlation increases. HSIC cl is too conservative. In contrast, HSIC new controls type I error well.
To compare the power of the tests, we choose one exposure variable from X at random as the causal exposure, and make the first η proportion of the outcomes in Y depend on the exposure. Specifically, within each cluster, outcomes in Y are generated as follows: a single exposure (say, the r-th variable in X) affects multiple outcomes, where ǫ ∼ N 3p (0 3p , � X ) and the effect sizes β s 's are generated from a Uniform(0, √ 25/m) (s = 1, . . . , ηp).
. . , β p X r1 , β p X r2 , β p X r3 t + ǫ, www.nature.com/scientificreports/ In addition to HSIC cl , we consider another HSIC-based test statistic HSIC mean , the original HSIC test with averaged observations at different time points for each cluster, i.e., The estimated power are presented in Fig. 2. Corresponding standard errors are provided in Supplementary D. We see that the new test outperforms HSIC mean in all cases. HSIC cl shows better performance than the new test when ρ c = 0.3 , but the performance decreases as ρ c increases. This is expected since averaging data at the cluster level will result in reduced evident information loss under high within-cluster correlation, while the new test keeps using the kernel information and still captures this signal. In addition, the new test works well under high dimensions.
In addition, we conduct power comparison between the new test and HSIC mean by the permutation test, Pearson type III approximation, and Davies' method, and results are provided in Supplementary B.
We also compare the computational cost of the new test with HSIC mean and HSIC cl and the results are shown in Table 4. We see that the new test is much faster than HSIC cl with good performance. HSIC mean is the fastest, but it has lower power than the new test.
Lastly, we also compare the performance of the new test to other existing independence tests, dCov 14 and HHG 35 that are based on the distance covariance and ranking of interpoint distances, respectively, and results are provided in Supplementary C. Analysis of MsFLASH data. As noted previously, the MsFLASH study was a randomized study of vaginal estrogen vs. two different placebos. To understand why the trial was negative, investigators were interested in studying whether microbiome is associated with metabolic pathways. Vaginal microbiota and vaginal fluid metabolites were characterized longitudinally and available in 141 participants 36 . For each arm, we have 45, 46, and 50 clusters (corresponding to a separate subject) with the equal cluster size 3 (corresponding to three clinical visits in which vaginal swabs were obtained). The vaginal microbiome profiles include abundance data of 381 taxa. The metabolome profiles comprise the abundance data of 171 metabolites that are grouped into 95 metabolic pathways. Across all 95 pathways, we conduct the association tests to detect the dependence between metabolites in each pathway and the overall vaginal microbiome compositions.
Here, we use the Gaussian kernel as well as the Bray-Curtis kernel that can be useful when the phylogenetic tree information is unavailable. For each test, the Bonferroni-corrected significance level is set to be 0.05/95 = 5.3 × 10 −4 . Table 5 shows the number of detected metabolic pathways associated with the vaginal microbiome composition. We see that the new method identifies a larger number of pathways than HSIC mean and HSIC cl for all cases, indicating the consistent improvement of the new test. In particular, the new test using the Gaussian kernel is more powerful than the Bray-Curtis kernel, indicating a possible non-linear relationship between some metabolites and microbial taxa abundances.
Collectively, these results indicate that for many key biological pathways, the link between the microbiota and metabolome remains in place and as expected. Thus, the failure of the MsFLASH trial may not result from a failure in this part of the hypothesis and additional work is needed to understand why the trial failed.

Discussion
We have introduced the new kernel-based test of independence for cluster-correlated data. The new approach is versatile and robust in that it avoids any parametric assumptions or settings. We have also proposed the analytic formulas for type I error control, offering easy off-the-shelf tools for large datasets. We have experimentally demonstrated that the new test exhibits superior power and work well particularly for high-dimensional settings with large within-cluster correlation.
As demonstrated, our approach is effective in assessing the generalized dependency between two sets of data when the samples are clustered. However, while our approach accommodates the correlation arising from the fact that multiple samples come from the same individual, we do not explicitly harness the longitudinal nature. www.nature.com/scientificreports/  www.nature.com/scientificreports/ Specifically, we primarily treat the samples as repeated measurements rather than true longitudinal profiles in assessing association. How to bring in the longitudinal structure remains a question of importance and a topic for further investigation. Our approach begins with pre-constructed kernel measures capturing pair-wise similarity in samples and is valid for any positive definite kernel metrics. However, kernel metrics that better capture the true relationship between the data will lead to improved power. Choosing an optimal kernel represents a general problem within the statistical learning literature. Some have proposed omnibus tests based on weighted averages of kernels, but this is a sub-optimal strategy since the HSIC statistics depend on the scale of the different kernels. A better solution is to move from the HSIC statistic to the p value scale with incorporation of permutation testing. However, this is again slow. One potential solution is to use the Cauchy-Combination method within this context 37 , but further evaluation is necessary.
A major contribution of this work is the computational efficiency of the proposed strategy which generalizes to both clustered and un-clustered data settings. The use of the Pearson type III approximation of the finite sample permutation distribution is fast which allows for accurate computation of tailed p values. For example, in an mbGWAS study looking at relationship between groups of SNPs and microbiome composition 38 , it is necessary to compute tens of thousands of tests and get p values at alpha levels as low as 10 −6 , for which permutation is infeasible. Consequently, the relevance and importance of our strategy will only continue to grow as such studies become more common.

Data availability
The data generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.   www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.